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1.  Introduction 


Every  year,  the  U.S.  Army  Research  Laboratory  (ARE)  in  conjunction  with  the  Programming 
Environment  and  Training  (PET)  component  of  the  Department  of  Defense  High-Performance 
Computing  Modernization  Program  (DOD  HPCMP)  hosts  several  interns.  These  interns  are 
always  college  students,  most  of  whom  are  either  juniors  or  seniors.  Their  principal  duties 
involve  carrying  out  research  involving  high-performance  computing  (HPC)  and/or  the  related 
disciplines  of  high-performance  networking  and  scientific  visualization.  This  report  is  based  on 
the  work  and  experiences  of  Earlene  L.  Thompson  during  her  internship  in  the  summer  of  2005. 

The  goal  of  this  topic  was  to  perform  preliminary  work  with  regard  to  Unified  Parallel  C  (UPC) 
and/or  Co-Array  Eortran  (CAP).  Neither  of  us  had  any  background  in  either  of  these  languages 
although  both  of  us  had  experience  using  C.  Additionally,  the  mentor  had  significant 
experience  using  Eortran,  parallelizing  programs,  and  UNIX/LINUX-based  systems.  The 
rationale  for  choosing  these  languages  has  to  do  with  the  Defense  Advanced  Research  and 
Projects  Agency  (DARPA).  They  have  a  project  called  High  Productivity  Computing  Systems 
(HPCS)  whose  goal  is  to  develop  PetaPLOP  computers.  There  are  currently  three  vendors 
working  on  this  project.  In  addition  to  developing  hardware  for  this  project,  the  vendors  are 
required  to  develop  new  programming  environments  to  facilitate  the  use  of  the  emerging 
platforms.  One  of  the  vendors  is  Cray,  Inc.,  and  they  have  selected  these  two  languages  as  part 
of  their  proposal.  While  the  languages  have  not  been  finalized  (adopted  by  a  national/ 
international  standards-setting  body),  they  appeared  to  have  matured  to  the  point  that  this  would 
be  a  viable  project. 

With  this  background,  it  was  decided  that  the  project  would  consist  of  porting  a  meaningful 
program  to  one  of  the  two  languages.  At  the  time,  porting  from  High-Performance  Eortran 
(HPP)  to  CAP  seemed  to  be  an  obvious  match.  An  early  version  of  a  workhorse  program 
developed  within  our  branch  seemed  to  be  a  reasonable,  if  not  perfect,  choice.  This  program  is 
called  COMPOSE  and  was  written  by  Ram  Mohan  and  Dale  Shires.  COMPOSE  is  a 
composites  manufacturing  simulation  code  used  to  optimize  the  fabrication  process.  The  HPP 
version  of  the  code  consists  of  '-'3000  lines  of  Eortran  code,  including  130  lines  specific  to  HPP. 
The  assumption  was  that  most  of  the  effort  in  porting  this  program  to  CAP  would  center  around 
these  130  lines  (primarily  declaration  statements).  When  it  was  shown  that  this  assumption  was 
incorrect  and  that  all  3000  lines  would  probably  need  to  be  modified  (to  provide  access  to 
shared  arrays),  it  was  decided  that  this  was  beyond  the  scope  of  this  project.  Therefore,  a  new 
project  needed  to  be  quickly  selected. 

Attempting  to  maintain  the  spirit  of  the  original  project,  it  was  decided  to  measure  the 
performance  of  the  NAS  benchmarks  on  the  Cray  XI .  This  project  was  selected  since  these  are 
widely  reported  and  highly  respected  benchmarks  (1).  Of  equal  importance,  references  on  the 
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web  indicated  that  a  message  passing  interface  (MPI)  version  of  these  benchmarks  could  be 
downloaded  from  NASA  Ames  Research  Center,  a  UPC  version  of  some  of  these  benchmarks 
could  be  downloaded  from  George  Washington  University  (GWU),  and  a  CAP  version  of  these 
benchmarks  could  be  downloaded  from  Rice  University.  Therefore,  it  was  expected  that,  given 
the  limited  amount  of  time  remaining,  we  would  be  able  to  run  all  of  these  benchmarks  for  a 
range  of  processor  counts  for  at  least  some  of  the  smaller  benchmark  sizes  (e.g.,  W,  A,  and  B). 
Our  results,  conclusions,  and  the  problems  we  ran  into  will  comprise  the  bulk  of  this  report. 


2.  The  Choice  of  Hardware  and  Related  Topics 


When  selecting  a  platform  on  which  to  carry  out  this  research,  the  obvious  choice  was  a  Cray 
XI  or  the  newer  Cray  XIE.  The  U.S.  Army  High-Performance  Computing  Research  Center 
(AHPCRC),  which  is  supported  by  both  the  U.S.  Army  and  the  DOD  HPCMP,  has  a  small 
Cray  XI  and  a  much  larger  Cray  XIE.  We  requested  and  received  access  to  the  Cray  XI  which 
is  reserved  primarily  for  educational  purposes.  The  compilers  on  this  system  support  both  UPC 
and  CAP.  The  system  consists  of  4  nodes,  each  with  16  SSP  processors  and  16  GB  of  main 
memory.  Each  SSP  processor  has  a  scalar  unit  rated  at  400  MPEOPS  and  a  vector  unit  rated  at 
3.2  GPEOPS.  Jobs  may  either  request  processors  in  terms  of  SSPs  or  MSPs.  In  the  later  case, 
each  MSP  consists  of  four  tightly  coupled  SSPs,  with  the  compiler  being  called  on  to  split  the 
work  among  as  many  of  the  SSPs  in  each  MSP  as  it  can.  Additional  information  on  the  design 
of  the  Cray  XI  can  be  found  on  the  Cray  website  http://www.cray.com. 

It  should  also  be  noted  that  the  processors  are  grouped  into  nodes,  with  each  node  consisting  of 
16  SSPs  (4  MSPs).  Ordinarily,  one  of  these  nodes  is  reserved  for  use  as  an  Input/Output 
node/Eogin  node,  which  can  also  be  used  to  run  interactive  jobs  using  up  to  16  SSPs  (4  MSPs). 
The  batch  system  has  dedicated  access  to  the  other  three  nodes.  We  were  given  special  access 
to  this  system  to  run  49  processor  jobs,  but  were  unable  to  run  64  processor  jobs.  The  main 
concern  here  is  that  it  is  very  difficult  to  show  high  levels  of  performance  when  using  all  of  the 
processors  in  a  system  (almost  any  system).  There  is  enough  interference  from  system 
daemons,  which  even  if  everyone  is  asked  to  logout,  rarely  will  one  show  anything  approaching 
linear  speedup.  In  fact,  most  large  systems  have  dedicated  I/O  nodes  that  may  not  even  be 
mentioned  when  discussing  the  system’s  size. 

Another  issue  is  the  unusual  approach  that  Cray  takes  to  scheduling  jobs  on  their  systems.  Jobs 
of  higher  priority  may  actually  cause  a  lower  priority  job  to  lose  its  processors  for  an  extended 
period  of  time.  Even  when  dealing  with  jobs  of  equal  priority,  various  forms  of  time  sharing 
seem  to  be  taking  place.  This  can  range  from  having  a  job  stream  rolled  out  when  a  new  job 
starts  up,  to  having  two  or  more  jobs  actually  time  sharing  their  processors  with  varying  time 
quanta  (probably  ranging  from  under  a  second  to  many  minutes  in  duration).  The 
documentation  indicated  that  this  could  result  in  widely  varying  run  times,  but  should  not 
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significantly  affect  the  CPU  time.  In  some  eases,  the  operating  system  left  hints  in  the  output 
file  as  to  ways  to  improve  the  run  time.  In  some  cases,  we  took  advantage  of  these  hints  and 
reran  the  jobs.  In  other  eases,  the  effeet  seemed  to  be  small  enough  that  the  hint  was  ignored 
and  work  eontinued  on  with  other  runs.  In  most  oases,  these  effeots  were  most  pronounoed 
when  using  small  numbers  of  prooessors  (e.g.,  1-4  SSPs  or  1  MSP). 

It  should  be  noted  at  this  point  that  our  prinoipal  goal  was  getting  our  feet  wet  when  using 
either  CAP  and/or  UPC.  It  was  not  antioipated  that  there  would  be  sufficient  time  to  worry 
about  modifying  the  oode,  or  even  putting  in  a  modest  number  of  oompiler  direotives  designed 
to  improve  either  veotorization  or  the  performance  of  the  oode  when  running  in  MSP  mode. 
Therefore,  it  was  expeoted  that  our  runs  would  be  made  with  the  highest  levels  of  optimization 
(soalar,  veotor,  and  MSP)  supported  by  the  oompiler,  and  we  would  just  have  to  hope  for  the 
best. 


3.  Porting  COMPOSE  to  HPF 


Sinoe  HPF  and  CAP  are  both  extensions  to  the  Fortran  standard,  it  was  assumed/hoped  that 
only  modest  ohanges  would  be  required  to  port  a  program  from  HPF  into  CAF.  This  would 
possibly  be  the  case  when  dealing  with  a  program  using  struotured  grids.  However, 
COMPOSE  uses  unstructured  grids  and  was  diffieult  to  parallelize  using  HPF  in  the  first  plaee. 
After  spending  several  weeks  working  on  this  projeet,  we  developed  serious  misgivings  about 
it.  From  our  slightly  naiVe  perspeetive,  it  appeared  as  though  it  would  be  neeessary  to  use  full 
Co-Array  syntax  for  virtually  all  of  the  array  referenees  in  order  to  support  a  mix  of  loeal  and 
remote  memory  aeeesses  on  a  seamless  basis.  In  eontrast,  HPF  had  largely  done  that 
automatically  based  on  distribution  statements  in  the  declaration  section  of  each  program  unit. 

This  was  significantly  more  work  than  we  had  antieipated.  Sinee  parallelization  is  normally  an 
all  or  nothing  proposition — some  forms  of  shared  memory  parallelism  (e.g.,  OpenMP)  being 
the  prineipal  exeeption — it  was  deeided  to  look  for  another  projeet  to  work  on.  It  is  possible 
that  it  might  have  been  easier  to  parallelize  COMPOSE  using  UPC,  but  we  did  not  look  into 
that  possibility.  Even  if  the  parallelization  had  been  straightforward,  using  UPC  would  have 
had  two  distinet  disadvantages.  The  first  is  that  translating  3000  lines  of  Eortran  oode  into  C 
oode  by  hand  is  nontrivial  (there  are  automatio  Eortran-to-C  translators,  but  their  output  is  so 
oryptio  as  to  make  the  translated  oode  unmaintainable).  The  seoond  issue  is  that  it  oan  be 
diffieult  to  obtain  a  reasonable  level  of  performanoe  on  a  veotor  prooessor  when  using  C  oode. 
Therefore,  it  was  deeided  to  look  for  an  entirely  new  projeet. 
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4.  The  New  Project:  Running  the  NAS  Benchmarks  on  the  Cray  XI 


It  was  desirable  that  any  new  projeet  have  the  following  eharaeteristies: 

•  Use  CAP  and/or  UPC, 

•  Have  another  version  parallelized  using  either  MPI  and/or  OpenMP  available  for 
eomparison  purposes,  and 

•  Fit  into  the  remaining  time  frame  given  the  available  human  and  eomputer  resourees. 

Based  on  these  eriteria  and  Mr.  Pressel’s  prior  experienee  running  the  MPI  version  of  the  NAS 
Parallel  Benchmarks  at  the  U.S.  Army  Research  Laboratory  -  Major  Shared  Resource  Center 
(ARL-MSRC),  it  was  decided  to  try  and  run  these  benchmarks.  A  search  of  the  web  showed 
that  researchers  at  George  Washington  University  (GWU)  and  elsewhere  had  already  ported 
several  of  the  NAS  benchmarks  to  UPC.  A  copy  of  these  benchmarks  was  freely  available  from 
a  website  at  GWU.  Similarly,  a  copy  of  a  CAP  version  of  the  NAS  benchmarks  was  freely 
available  from  a  website  at  Rice  University.  While  the  latter  was  based  on  a  slightly  older 
version  of  the  NAS  benchmarks  (version  2.3  while  the  most  recent  MPI  version  is  2.4),  this  did 
not  appear  to  be  much  of  a  problem. 

An  important  reason  for  selecting  this  project  is  that  the  original  software  was  developed  by  the 
NAS  Division  at  NASA  Ames  Research  Center  and  is  one  of  the  most  highly  respected  and 
widely  used  set  of  benchmarks  for  parallel  computers  (1).  These  benchmarks  are  based  on  the 
needs  of  computational  fluid  dynamics  applications,  but  appear  to  have  relevance  to  other 
disciplines  as  well.  In  Pressel  and  Jelani  (2),  a  subset  of  the  NAS  benchmarks  (BT,  CG,  lower 
uppercase  decomposition  [LU],  and  SP)  were  compared  to  the  Linpack  Parallel  benchmark  (5), 
STREAM  benchmark  (4),  and  peak  processor  speed  (in  MFLOPS).  It  was  found  that, 
collectively,  the  subset  of  the  NAS  benchmarks  were  the  best  predictors  of  the  performance  of 
applications  in  Computational  Chemistry  and  Material  Science,  Climate/Weather/Ocean 
Modeling  and  Simulation,  Computational  Fluid  Dynamics,  and  Computational  Structural 
Mechanics  when  using  1-1152  processors  on  15  different  system  types  from  six  different 
vendors. 

The  UPC  versions  of  the  NAS  benchmarks  (NAS-UPC)  were  easily  downloaded  from  the 
GWU  site,  and  Mr.  Pressel  had  previously  downloaded  the  MPI  versions  of  the  NAS 
Benchmarks  (NAS-MPI)  (version  2.4)  for  another  project.  Efforts  to  download  the  CAP 
version  of  the  benchmarks  (NAS-CAF)  from  the  website  failed.  However,  with  help  from 
Charles  Koebel,  we  were  able  to  obtain  a  copy  of  those  benchmarks  as  well.  It  would  be  nice  to 
report  that  our  troubles  ended  there,  but  unfortunately,  this  is  not  the  case.  Even  though  both  of 
us  made  a  serious  effort  to  compile  and  link  both  NAS-UPC  and  NAS-CAF,  we  kept  meeting 
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with  failure.  Jeff  Dawson  of  the  AHPCRC  provided  additional  guidanee,  but  we  just  eould  not 
get  those  jobs  to  eompile  and  link  no  matter  what  we  tried  (including  renaming  the  files,  and 
making  simple  changes  to  code  in  response  to  compiler  error  messages). 

Eventually,  guidance  was  received  from  GWU  that  the  version  of  the  NAS-UPC  benchmarks 
that  is  on  their  web  site  is  based  on  a  later  version  of  the  UPC  standard  than  that  supported  by 
the  Cray  compilers  we  had  access  to,  and  that  it  would  be  impossible  to  run  these  benchmarks 
on  the  Cray  XI  (5).  Apparently,  another  version  of  the  NAS-UPC  benchmarks  does  exist,  but 
was  not  currently  available  for  downloading.  There  were  many  problems  with  NAS-CAF 
benchmarks.  Unfortunately,  the  source  of  several  of  those  problems  remains  a  mystery. 
Fortunately,  the  NAS-MPI  benchmarks  all  compiled  on  the  first  try.  Unfortunately,  we  were 
unable  to  get  the  FU  benchmark  to  run  properly  on  the  Cray  XI,  even  when  all  optimizations 
were  disabled  using  the  -OO  option. 


5.  Procedure 


In  general,  Cray  seems  to  recommend  using  MSP  mode.  While  it  may  be  possible  that  for 
well-tuned  code  that  will  normally  be  the  best  choice,  there  were  serious  concerns  that  the 
compiler  would  have  trouble  using  all  of  the  SSPs  per  MSP  when  running  untuned  code  in  MSP 
mode.  Therefore,  it  was  decided  that  initially  we  would  concentrate  on  running  in  SSP  mode. 
Eventually  MSP  mode  runs  were  also  made,  but  in  many  cases  they  were  considerably  slower 
than  SSP  mode  runs  using  a  comparable  number  of  processors  (remember,  4  SSPs  =  1  MSP,  so 
an  example  of  a  comparable  number  of  processors  is  4  MSPs  vs.  16  SSPs). 

In  order  to  be  as  fair  as  possible,  without  actually  adding  compiler  directives  or  in  some  other 
manner  tuning  the  benchmarks,  it  was  decided  to  use  the  highest  level  of  optimization  possible. 
The  documentation  for  the  Fortran  compiler  indicates  that  -03  may  in  fact  not  be  the  highest 
level  of  optimization  possible.  Therefore,  the  Fortran  runs  were  compiled  using  -O  scalarS, 
vectors,  ssp  for  SSP  mode  runs,  and  -O  scalar3,vector3  for  MSP  mode  runs.  For  MSP  mode 
runs,  -O  scalarS,  vectorS,  streamS,  aggress  was  also  tried.  However,  it  seemed  to  make  little  or 
no  difference  in  performance,  and  some  jobs  actually  ran  slightly  slower  with  this  combination 
of  options. 

It  was  observed  that  the  embarrassingly  parallel  (EP)  benchmark  ran  substantially  slower  on  the 
Cray  XI  than  on  other  systems  Mr.  Pressel  had  previously  benchmarked.  Since  the  EP 
benchmark  spends  almost  all  of  its  time  calculating  random  numbers,  it  was  assumed  that  the 
compiler  had  failed  to  vectorize  the  random  number  generator.  After  carefully  rereading  the 
documentation,  it  was  found  that  the  benchmarks  come  with  several  alternative  random  number 
generators.  One  of  the  alternative  random  number  generators  (randdpvec)  is  computationally 
less  efficient  than  the  default  random  number  generator.  However,  since  it  is  written  in  a 
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manner  that  supports  automatic  vectorization  by  the  compiler,  it  was  expected  to  significantly 
improve  the  performance  of  the  EP  benchmark,  and  probably  to  a  lesser  extent  the  performance 
of  some  of  the  other  benchmarks  as  well.  Therefore,  all  of  the  SSP  and  MSP  runs  were  made 
one  more  time. 

The  documentation  for  the  C  compiler  indicated  that  -03  was  the  maximum  level  of 
optimization.  Therefore,  the  only  other  option  that  was  required  was  -h  SSP,  which  specifies 
SSP  mode.  Historically,  C  code  has  not  performed  well  on  vector  processors.  Furthermore,  it  is 
not  clear  which  sorting  algorithms  would  be  the  best  match  for  a  vector  processor.  However,  it 
is  likely  that,  as  with  the  random  number  generator,  it  would  not  be  the  best  choice  for  a  scalar 
processor.  Therefore,  we  did  not  expect  particularly  good  performance  for  the  IS  benchmark  on 
the  Cray  XI,  and  unfortunately  our  expectations  were  met. 

Ms.  Thompson  made  her  runs  using  the  16  processes  (SSP)  interactive  partition,  while  Mr. 
Pressel  worked  with  Hung  Nguyen  of  the  AHPCRC  to  work  out  the  details  of  how  to  submit 
batch  jobs.  For  both  the  interactive  and  batch  jobs,  there  were  a  number  of  problems  associated 
with  the  way  the  memory  system  works.  In  particular,  most  jobs  are  limited  to  using  1  GB  of 
main  memory  per  process  when  running  in  SSP  mode  (4  GB  in  MSP  mode),  and  there  is  no 
virtual  memory.  For  interactive  jobs,  one  can  easily  override  this  using  either  the  -c  or  -m 
options  for  mpirun  (up  to  the  limits  of  available  memory  on  the  interactive  node  -  1 6  GB  in  this 
case).  However,  for  batch  jobs,  the  only  solution  to  the  problem  is  to  use  the  -m  exclusive 
option  in  combination  with  the  -N  option.  This  specifies  how  many  processes  will  run  on  each 
node,  and  allows  the  processes  to  use  as  much  memory  as  they  need  (up  to  the  available  limits 
on  the  node,  16  GB  in  this  case).  Unfortunately,  it  frequently  meant  that  a  dedicated  node  was 
required,  when  a  more  flexible  system  might  have  been  able  to  make  better  use  of  the  resources. 

An  associated  problem  is  that  one  must  place  a  power  of  2  number  of  processes  on  each  node. 

In  some  cases,  this  meant  that  more  nodes  were  needed  than  would  have  been  required  with  a 
more  flexible  system.  In  other  cases,  it  actually  prevented  us  from  running  some  Class  D  (over 
time  several  problem  sizes,  called  classes,  have  been  defined  for  the  NAS  benchmarks — W,  A, 
B,  C,  and  D)  jobs.  They  simply  could  not  be  scheduled  on  this  system,  even  though  with  a 
more  flexible  system  it  should  have  been  possible  to  run  some  of  these  jobs.  An  example  of  this 
is  when  scheduling  a  25  processor  job  on  the  three  nodes  of  the  batch  partition,  the  ideal 
schedule  would  be  to  place  8  processes  on  each  of  two  nodes,  and  9  processes  on  the  third  node. 
Instead,  the  best  that  could  be  done  using  only  three  of  the  nodes  was  to  schedule  16  processes 
on  one  node,  and  the  remaining  9  processes  on  a  second  node.  While  it  might  have  been 
possible  to  run  the  Class  D  BT  benchmark  with  the  ideal  schedule,  the  less-than-ideal  schedule 
lacked  sufficient  memory  with  which  to  run  this  job.  It  should  be  noted  that  in  order  to  use 
more  than  1  GB  of  memory/SSP,  one  needs  to  request  exclusive  nodes  when  running  in  batch 
mode.  There  was  no  way  in  which  to  run  this  job  using  all  four  nodes. 
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Since  we  were  unable  to  run  NPB-CAF  and  NPB-UPC,  a  literature  seareh  was  eondueted  using 
Google.  Several  papers  were  found  that  provided  additional  results  for  these  benehmarks  (6-8). 
Results  excerpted  from  those  papers  have  been  used  to  supplement  our  own  results.  Both 
sources  of  results  will  be  used  in  the  remainder  of  this  work. 


6.  Results 


This  projeet  was  intended  to  eoncentrate  on  learning  as  mueh  as  possible  in  the  short  time 
available  about  CAP  and/or  UPC.  Unfortunately,  as  was  previously  mentioned,  we  were  unable 
to  get  any  of  the  available  NAS  benehmarks  written  in  either  of  these  languages  to  run  on  the 
Cray  XI .  However,  others  with  more  time/expertise  have  faired  better.  Therefore,  the 
following  two  eharts  will  compare  the  performance  of  NPB-MPI  based  on  our  measurements,  to 
those  reported  for  NPB-CAF  and  NPB-UPC  as  reported  in  the  literature. 

Figure  1  shows  that  for  the  BT  benchmark  of  version  2.4  of  NPB-MPI  (Class  B),  it  veetorized 
poorly,  if  at  all,  on  the  Cray  XI.  This  was  a  surprise  sinee  Saini  and  Bailey  (9)  elearly  indieate 
that  an  earlier  version  of  this  benehmark  was  able  to  aehieve  20%  of  peak  on  the  Cray  C90. 
After  eonsulting  with  the  NPB  group  at  NAS,  we  were  given  access  to  a  speeially  written 
version  of  NPB-MPI.  Clearly  it  performs  mueh  better.  On  the  other  hand,  it  is  important  to 
note  how  poorly  the  UPC  version  of  this  benehmark  performed.  One  should  also  note  that,  in 
some  eases,  the  MSP  runs  appear  to  have  been  using  only  one  of  the  four  SSPs  per  MSP, 
resulting  in  their  poor  showing. 


CrayXI  Comparsion  for  the  NPB  Class  B  BT 
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Figure  1.  A  comparison  of  the  performance  of  different  implementations/modes  of  running  of 
the  NPB  2.4  BT  benchmark  on  the  Cray  XI. 
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Similarly,  figure  2  shows  the  results  for  the  MG  benchmark  of  NPB  version  2.4  (Class  B). 
However,  in  this  case,  the  compiler  was  able  to  achieve  a  significant  level  of  vectorization  for 
NPB-MPI.  Furthermore,  the  compiler  was  able  to  make  efficient  use  of  all  four  SSPs  per  MSP. 
It  is  interesting  to  note  that  the  per-SSP  performance  when  running  in  MSP  mode  actually 
exceeded  that  of  runs  done  in  SSP  mode  by  a  slight  amount.  The  performance  of  NPB-CAF 
(based  on  version  2.3  of  NPB-MPI)  appears  to  be  competitive  with  that  of  the  NPB-MPI  in  this 
case.  The  NPB-UPC  benchmark  when  running  in  MSP  mode  was  almost  as  good,  although  it 
appears  to  be  running  into  scalability  problems.  It  is  surprising  how  poorly  the  SSP  mode  run 
for  the  NPB-UPC  benchmark  performed. 

Unfortunately,  for  most  of  the  other  benchmarks,  we  were  unable  to  find  sufficient  results  for 
NPB-CAF  and/or  NPB-UPC  to  make  it  worthwhile  to  make  comparisons.  Figures  3-10  will 
compare  the  results  for  each  of  the  eight  NPB-MPI  version  2.4  (Class  B)  benchmarks  on  a  cross 
platform  basis  {10).  It  should  be  noted  that  on  a  per  benchmark  basis,  the  SSP  and  MSP  runs 
were  compared  and  whichever  set  of  runs  showed  the  best  performance  at  32  processors  (36  in 
the  cases  of  BT  and  SP),  that  set  was  used  for  this  set  of  comparisons.  These  charts  will  stop  at 
32  (36)  processors  since  that  was  the  limit  of  what  could  be  run  on  the  Cray  XI  we  were  using. 
In  general,  all  of  the  remaining  systems  continued  to  show  scaling  for  these  benchmarks  when 
using  larger  numbers  of  processors.  Presumably,  the  same  would  be  true  when  using  a  larger 
Cray  XI. 


Figure  2.  A  comparison  of  the  performance  of  different  implementations/modes  of  running  of 
the  NPB  2.4  MG  benchmark  on  the  Cray  XL 
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Cross  Platform  Performance  Comparsion 
for  the  NPB  Class  B  BT  Benchmark 
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Figure  3.  Cross-platform  performance  comparison  for  the  NPB  Class  B  BT  benchmark. 
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Figure  4.  Cross-platform  performance  comparison  for  the  NPB  Class  B  CG  benchmark. 
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Cross  Platform  Performance  Comparison 
for  the  NPB  Class  B  EP  Benchmark 
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Figure  5.  Cross-platform  performance  comparison  for  the  NPB  Class  B  EP  benchmark. 


Cross  Platform  Performance  Comparison 
for  the  NPB  Class  B  FT  Benchmark 
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Figure  6.  Cross-platform  performance  comparison  for  the  NPB  Class  B  FT  benchmark. 


Reviewing  figures  3-10,  there  are  some  concerns  over  the  scalability  of  these  benchmarks  on 
the  Cray  XI.  In  general,  given  a  fixed  problem  size,  for  large  numbers  of  processors,  one  will 
normally  see  issues  with  scalability.  However,  for  32-36  processors  and  the  Class  B  NPB-MPI 
benchmarks,  one  rarely  observes  these  problems.  There  are  two  ways  to  interpret  this  data. 
Cray  could  still  have  problems  with  the  implementation  of  MPI  on  the  Cray  XI  (something  that 
they  have  previously  warned  about  in  their  training  classes).  Alternatively,  when  going  to 
larger  numbers  of  processors,  the  available  parallelism  for  use  in  vectorizing  the  code  may  be 
sufficiently  limited  as  to  result  in  suboptimal  vector  lengths  (normally  referred  to  as  short 
vectors). 
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Cross  Platform  Performance  Comparison 
for  the  NPB  Class  B  IS  Benchmark 
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Figure  7.  Cross-platform  performance  comparison  for  the  NPB  Class  B  IS  benchmark. 


Cross  Platform  Performance  Comparison 
for  the  NPB  Class  B  LU  Benchmark 


0) 

u 

c 

ns 


E 


os 

Q. 


ns 

o 


Number  of  Processors 


— ^  SGI  03K  512  PE  400  MHz 

IBM  SP  Power3  375  MHz  (NH2) 

single  rail  Colony  Switch 

IBM  SP  Power4  1 .7  GHz  dual  rail 

Federated  Switch 

Intel  Pentium  4  Cluster  Myrinet 

2000  Intel  Compiler 

IBM  SP  Power4  1 .3  GHz  dual  rail 

Colony  Switch 

•  SGI  Altix  Itanium2  1 .5  GHz 

Cray  XI  Air  Cooled  400  MHz 
NPB3.1  -  pre-release 


Figure  8.  Cross-platform  performance  comparison  for  the  NPB  Class  B  LU  benchmark. 


Figures  1 1  and  12  show  the  parallel  effieieney  of  the  Cray  XI  and  three  other  systems  for  the 
BT  and  LU  benchmarks  as  a  function  of  the  problem  size  (Class).  Ideally,  the  parallel 
efficiency  should  be  close  to  1.0,  and  numbers  below  0.70  (70%)  are  generally  considered  to  be 
undesirable.  It  is  interesting  to  note  that,  regardless  of  the  cause,  the  Cray  XI  has  a  significantly 
lower  level  of  parallel  efficiency  than  the  other  three  systems.  This  helps  to  explain  why,  in 
general,  when  looked  at  on  a  per-SSP  basis,  the  Cray  XI  is  at  best  a  good  performer. 


11 


Cross  Platform  Performance  Comparison 
for  the  NPB  Class  B  MG  Benchmark 
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Figure  9.  Cross-platform  performance  comparison  for  the  NPB  Class  B  MG  benchmark. 
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Cross  Platform  Performance  Comparsion 
for  the  NPB  Class  B  SP  Benchmark 


Number  of  Processors 


— ^SGI  03K  512  PE  400  MHz 

IBM  SP  Powers  375  MHz  (NH2) 
single  rail  Colony  Switch 

IBM  SP  Power4  1 .7  GHz  dual  rail 
Federated  Switch 

Intel  Pentium  4  Cluster  Myrinet 
2000  PGI  Compiler -Mvect2 

-->t-  IBM  SP  Power4  1 .3  GHz  dual  rail 
Colony  Switch 

SGI  Altix  Itanium2  1 .5  GHz 


Aircooled  Cray XI  400  MHz- 
MSP  Mode  (SSP  Equivalents) 


Figure  10.  Cross-platform  performance  comparison  for  the  NPB  Class  B  SP  benchmark. 


7.  Observations  and  Conclusions 


Compared  to  HPF,  CAF  is  not  a  simple  extension  to  the  Fortran  programming  language.  UPC 
is  still  an  evolving  target,  whieh  makes  it  diffieult  to  write  programs  that  are  portable  between 
compilers.  UPC,  CAF,  and  FIPF  all  appear  to  suffer  from  the  same  limitation;  they  are  likely  to 
generate  a  large  number  of  short  messages.  This  makes  programs  written  in  those  languages 
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Figure  11.  Comparison  of  the  parallel  efficiency  for  different  systems  for  the  NPB  2.4  BT  benchmark 
when  using  36  processors. 


Figure  12.  Comparison  of  the  parallel  efficiency  for  different  systems  for  the  NPB  2.4  LU  benchmark  when 
using  32  processors. 
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much  more  dependent  on  the  lateney  of  the  message  passing  system.  On  the  Cray  XI,  the  UPC 
and  CAP  eompilers  are  able  to  take  advantage  of  the  same  low-latency  hardware  support  as 
used  by  calls  to  SHMEM.  However,  many  eompeting  systems  are  optimized  for  bandwidth,  not 
latency,  and  would  be  poor  ehoiees  for  use  with  these  languages.  Additionally,  as  the  proeessor 
speeds  inerease,  even  a  low-latency  interface  may  appear  to  be  expensive  when  measured  in 
mueh  more  dependent  on  the  latency  of  the  message  passing  system.  On  the  Cray  XI,  the  UPC 
terms  of  missed  opportunities  to  start  floating  point  operations.  Therefore,  the  long-term 
viability  of  these  languages  is  uneertain.  Additionally,  sinee  it  is  diffieult  to  veetorize  C  eode, 
running  UPC  on  the  Cray  XI  would  appear  to  be  a  questionable  proposition. 

For  some  of  the  benchmarks  when  using  NPB-MPI  (CG,  MG,  and  SP),  MSP  mode  was  the  best 
choiee.  For  the  remaining  five  benchmarks,  SSP  mode  was  a  strong  winner.  We  believe  that, 
regardless  of  whieh  mode  one  ehooses  to  use,  it  is  best  to  use  the  equivalent  number  of  SSPs  as 
the  proeessor  eount  when  eomparing  performanee  with  other  systems  on  a  per-proeessor  basis. 

It  should  be  noted  that  this  approaeh  to  reporting  performanee  will  in  no  way  affect  efforts  to 
report  peak  performance  for  a  eode,  or  related  metries. 

As  was  seen  when  the  NPB-MPI  version  3.1  (pre-release)  BT  benchmark  was  run,  tuning  for  a 
new  elass  of  eomputer  arehiteeture  ean  signifieantly  affeet  performanee.  It  is  likely  that 
additional  tuning  of  the  BT,  CG,  EP,  and  IS  benchmarks  could  improve  their  performanee  on 
the  Cray  XI .  In  some  cases,  this  tuning  might  hurt  the  performance  on  nonveetor  systems,  in 
whieh  ease  one  might  need  to  eonsider  the  desirability  of  supporting  alternative  versions  of  the 
code.  In  other  eases,  it  is  likely  that  a  single  eode  base  eould  be  maintained,  but  that  provisions 
for  additional  parameters  in  either  the  make.def  file  or  one  of  the  inelude  files  would  be  required 
to  support  eustomization  of  the  implementation  for  the  widest  possible  range  of  hardware. 
Reeommendations  for  some  of  these  ehanges  have  been  passed  on  to  Haoqiang  Henry  Jin  of  the 
NAS  group  at  NASA  Ames  Researeh  Center. 
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